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ABSTRACT 

Smith (1969) reported the results of an instrument 
for measuring teacher judgment of written composition. His test was 
first administered to a group of ''experts 11 whose ratings were in high 
agreement. Then the test was given to a sample of over 200 teachers 
and lay readers. Among Smith's conclusions was that over half of the 
teachers have judgment which differs significantly from the experts. 
This study sought to determine if rater differences as measured by 
Smith's test would remain constant for another set of essays. Six 
raters were selected, on the basis of their scores on Forms A and B 
of the Smith test, to read and score 71 seventh-grade essays. No 
significant differences were observed between "good" and "bad" 
raters. The results cast doubt on the validity of Smith's tost as a 
general instrument for assessing essay-rating behavior. Although the 
test does appear to separate raters in terms of their rankings of 
essays, and even though there rankings are relatively tellable, 
diff erence between raters did not remain constant for another set of 
essays. (Aut hor/LR) 
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A VALIDATION OF THE SMITH TEST FOR MEASURING 
TEACHER JUDGMENT OF WRITTEN COMPOSITION 



Thomas E. Whalen 



Literature on the tneasurement of writing ability Is replete with evidence 



of the unreliability and/or Invalidity of reader evaluation of student writing. 



Schumann (1968) stated that "research Indicates that the youngster v/ho has 



neat pcnhvinshlp will get at least a "C" grade In composition work Irrespective 
of what he actually says" (p. 1163). 



As early as 1921, Hopkins demonstrated that the score a student made on a 
College Boerd examination might well depend more on which year he appeared for 
the examination, or on which person read his paper, than It would on what he had 
written. Godshalk ard others (1966) presented a definitive review of the shifts 
In College Board testing procedure from Its Inception. They concluded that the 
two main sourcer of unreliability were (1) differences in quality of student 
writing from one topic to another, ond (2) the differences among readers In 
what they consider the characteristics of good writing. 

Evidence to support the second source of unreliability above was presented 
by Dlederfch and French (1961) * The authors conducted a factor analytic study 
Involving fifty-three readers from six different professional ar^as. The study 
revealed five "schools of thought'* with regard to measuring composition skill: 

(I) Ideas, (2) form, (3) flavor, (4) mechanics, and (5) wording. 

The five reader-factors were Identified by a "blind" classification of 
11,018 counts written on 3,557 papers. The readers Included college English 
teachers, social scientists, writers and editors, lawyers, natural scientists, 
and business executives. Ninety-four percent of the papers received seven or more 
O of the nine possible grades, and no paper received less than five different grades. 
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The median correlation between readers was .31. Readers In each field agreed 
slightly better with the English teachers than with one another. Three College 
Board tests taken by the student writers formed a separate test-factor that had 



wording (.45). 

Despite, this apparently overwhelming evidence of rater Inconsistency, 
efforts continue to be made, and rightly so, toward the achievement of more 
reliable procedures for assessing students 1 writing. Smith (I960) reported the 
results of an instrument for measuring teacher Judgment In the evaluation of 
w-ltten composition. He constructed a test to determine how well teachers agree 
In their rating behavior with a jet of expert English teachers. The test consists 
of two forms, A and B, each containing five short essays taken from the Sequential 
Tests of Educational Progress, Essay Test, and from other samples of actual student 
writing. Raters are asked simply to rank the five essays on each form from best 
to worst. 

Smith first administered the test to a group of five ''experts." The raters 
In this group were all secondary English teachers "who had been formally recognized 
as outstanding In the teaching of composition within their school districts or 
by some outside agency" (p. 187). Impressive reliability coefficients were 
reported for the expert raters. I, iter-rater reliabilities (using Sns*decor'$ 

formula) ranged from .840 to .920 for two administrations for forms A and E. 
Reliabilities of average ratings ranged from .963 to .983. The test-retest 
reliability was reported as LOO (p.183). 

The test was then administered to a sanplc of over 200 teachers and lay 
readers to determine the extent of their agreement with the experts. Smith found 
^ much greater variance among subjects In the sample population than among the 



practically zero correlation with all reader-factors except mechanics (.50) and 




Among the conclusions reached by Smith were the following: 
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1, Judgment as measured by this test fs not related to experience, 
academic background or professional training. 

2. More than half the teachers disagree to some extent with the experts In 
judgment as measured by this test. 

?. Between ten and twenty percent of classroom teachers have Judgment 
that Is contrary to that of the experts, and thus, "these persons are not 
competent to make such judgments'^?. 190). 

As possible applications for his test, Smith suggests Its use as part of a battery 
of tests to screen composition reader applicants, and a$ a too] to screen raters 
In research when judgment In the evaluation of written composition Is a factor. 

It might also be used to provide Individual and prospective teachers with knowledge 
of their judgment In the evaluation of written composition (p. 193)* 

The fvrpose of the present study was to determine to what extent the re- 
sults of Smith's test can be generalized to other essay-rating situations. If 
Smith's test can, Indeed, provide valid measurements of rating behavior, then 
differences between raters on his test should remain constant acorss other samples 
of writing judged by the same raters. 



METHOD 

Forms \ and 6 of Smith's test were administered to thl rty-thr*:: Individuals 
Including nineteen elementary and secondary eachers and fourteen graduate students 
In educational psychology. Scores from the two forms were combined (a procedure 
suggested by Smith to Increase reliability), producing a scale from zero to ten. 
High scores (8, 9* or to) Indicated agreement with the experts; low scores (0 
through 5) represented disagreement ; scores of 6 and / Indicated Judgment that Is 
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Slx of the thirty-three raters were selected on the basis of t > 5 clr test 
scores to read and score seventy-one seventh grade essays. The essays w*.re 
gathered from three seventh grade English classes of average ability. All 
students wrote on the same toplc--the1r reactions to the novel The Adventures 
of Tom Sawyer . The essays were all approximately 200 words In length. None of 
the ratsrs was associated with the school from which the essays were selected 
nor had any knowledge of the students whose essa/s he rated. 

Four of the six raters were In high agreement with the experts on Smith's 
test. The/ achieved scores of 8,8,9, and 10 on the test. Two of the raters were 

In complete disagreement, having both received scores of four. A reliability 

coefficient for average ratings (Ebel, 1950 was calculated for the group of four 
who were In agreement. A second group of raters was formed which Included the 
two raters In disagreement with the experts and two who were In high agreement 
(randomly selected from the previous group of four). The Interjudge reliability 
was calculated for the second group and wes compared statistically with the 
coefficient for the group In complete agreement. In addition, an Intercorrelat Ion 

matrix of all six Judges' ratings was generated. Coefficients In this matrix were 

compared to determine the extent of agreement between individual raters. 



For the sample of thirty-three teachers and graduate students, the 
scoring range on Smith's test was from three through ten. The mean score wa* 

6.27 with a standard deviation of 1.91* Fully two-thirds of the sample disagreed 
to some extent with the experts (scores of seven or below). One-third of the 
raters were In complete disagreement (a score of five or less). A comparison of 
mean scores for teachers versus graduate students showed no appreciable difference 
between groups ** 6.36 and 6.21 » respectively, in general, these findings were 
‘ccord with the results of Smith's research except that a somewhat greater 



aga^entage of persons In this sample had Judgment contrary to that of the experts. 



RESULTS ANO DISCUSSION 
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The results of the analysis of variance to determine the reliability of 
averaged scores for both groups Is given In Table I* This analysis is appropriate 
when the raters' scores are eventually averaged and Is designed to eliminate 
variance due to the raters 5 operating at different means. In this study, the 
readers were asked to rate each of the seventy-one essays by assigning a grade of M A", 
"B", "C", or "0". These ratings were then quantified on a 4-polnt scale. 

The reliability for the four Judges In agreement was .79. For the mixed 
group composed of two Judges in agreement and two in disagreement, the coefficient 
of rel labll I ty was .84. Thus, a higher rel labi 1 1 ty was calculated for the group 
whose members had demonstrated opposing views with regard to the essays on Smith's 
test. These two coefficients were compared (lordahl, 1967) and were found not to 
differ significantly. 

Trble 2 shows the correlation matrix for all six Judges 1 scores. Judges 
5 and 6 were the two tn disagreement with the experts. Judges 2 and 4 were those 
selected for Inclusion In Group II, A comparison of coefficients In the matrix 
Indicated that Judge 5. t;ho scored low on the Smith test, was In high agreement 
with Judges 2 and 4, who scored high on the test. In fact, the average correlation 
(using Fisher's z-trans format Ion) between this low-scoring judge and the two 
high-scoring judges was greater than the correlation between the two high-scoring 
judges themselves. Judge 6 disagreed to a greater extent with Judges 2 and 4. 

However, his average correlation with the high-scoring judges Indicated that he 
also agreed with them to a greater extent than they agreed with one another. 

The evidence In this Investigation casts doubt on the validity of Smith's 
test as a general Instrument for assessing ossay-ratlng behavior. Although the 
test does appear to separate raters In terms of their rankings of the ten essays, 
and even though these rankings are relatively reliable measures (.87 for test- 
retest using combined scores from forms A and B) , differences between raters did 
O emaln constant for another set of essays judged by the same raters. Additional 



rch Is necessarv before this test should be applied seriously to any of the 
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TABLE 1 



ANALYSIS OF VARIANCE RELIABILITIES OF 
AVERAGED RATINGS FOR ESSAY GRADES 



GROUP I 

(In Agreement ■ ) 



Source 


DF 


SS 


MS 


Essays 


70 


132.92 


1.90 


Raters 


3 


4.42 


1.47 


Error 


210 


82.58 


0.39 


Tota 1 


283 


219-92 






Reliability •» 1 - (MS . 


error/MS essays) « 0.79 








GROUP II 






(In Disagreement) 




source 


DF 


SS 


MS 


Essays 


70 


149.77 


2.14 


Raters 


3 


3.20 


1.07 


Error 


210 


73.80 


0.35 


Total 


283 


226.77 






Reliability - 1- (MS 


error/MS essays) • 0.84 





TABLE 2 



fiiiSJL 

I 



Intercorre let Ions of Six Raters 



2 


1 


— 5 — 


i 


£ 


.44 


.51 


.50 


.51 


.39 


... 


.46 


.52 


.65 


.67 




• - * 


.52 


.60 


.4? 








.63 


.40 








... 


.52 
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